Performance evaluation of various training data in English-Persian Statistical Machine Translation
نویسندگان
چکیده
Globalization and the continued increase in international travel and commerce have made automatic translation systems an attractive area of research and development. Even as technology opens up e-commerce opportunities, companies must overcome language barriers to reach new potential customers and business partners. With the advent of Web2.0 technologies, machine translation and tools like Google Translate have made the web more accessible. Machine translation is usually employed to translate text from one language into another. Statistical Machine Translation has been used for translation between many language pairs contributing to its popularity in recent years. It has however not been used for the English/Persian language pair. This paper presents the first such attempt and describes the problems faced in creating a corpus and building a base line system. Our experience with the construction of a parallel corpus during this ongoing study and the problems encountered especially with the process of alignment are discussed in this paper. The prototype constructed and its evaluation is briefly described and results are analyzed. In the final part of the paper, conclusions are drawn and work planned for the future is discussed.
منابع مشابه
The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملExtracting Persian-English Parallel Sentences from Document Level Aligned Comparable Corpus using Bi-Directional Translation
Bilingual parallel corpora are very important in various filed of natural language processing (NLP). The quality of a Statistical Machine Translation (SMT) system strongly dependent upon the amount of training data. For low resource language pairs such as Persian-English, there are not enough parallel sentences to build an accurate SMT system. This paper describes a new approach to use the Wiki...
متن کاملImproved Language Modeling for English-Persian Statistical Machine Translation
As interaction between speakers of different languages continues to increase, the everpresent problem of language barriers must be overcome. For the same reason, automatic language translation (Machine Translation) has become an attractive area of research and development. Statistical Machine Translation (SMT) has been used for translation between many language pairs, the results of which have ...
متن کاملCreation of comparable corpora for English-Urdu, Arabic, Persian
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has rec...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کامل